{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Distributed Training\n",
    "\n",
    "**Learning Objectives**\n",
    "  - Use AI Platform Training Service to run a distributed training job\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In the previous notebook we trained our model on AI Platform Training Service, but we didn't recieve any benefit. In fact it was much slower to train on the Cloud (5-10 minutes) than it was to train locally! Why is this?\n",
    "\n",
    "**1. The job was too small**\n",
    "\n",
    "AI Platform Training Service provisions hardware on-demand. This is good because it means you only pay for what you use, but for small jobs it means the start up time for the hardware is longer than the training time itself!\n",
    "\n",
    "To address this we'll use a dataset that is 100x as big, and enough steps to go through all the data at least once.\n",
    "\n",
    "**2. The hardware was too small**\n",
    "\n",
    "By default AI Platform Training Service jobs train on an [n1-standard-4](https://cloud.google.com/compute/docs/machine-types#standard_machine_types) instance, which isn't that much more powerful than our local VM. And even if it was we could [easily increase the specs](https://cloud.google.com/compute/docs/instances/changing-machine-type-of-stopped-instance) of our local VM to match.\n",
    "\n",
    "To get the most benefit out of AI Platform Training Service we need to move beyond training on a single instance and instead train across multiple machines.\n",
    "\n",
    "Because we're using `tf.estimator.train_and_evaluate()`, our model already knows how to distribute itself while training! So all we need to do is supply a `--scale-tier` parameter to the AI Platform Training Service train job which will provide the distributed training environment. See the different scale tiers avaialable [here](https://cloud.google.com/ml-engine/docs/tensorflow/machine-types#scale_tiers). \n",
    "\n",
    "We will use STANDARD_1 which corresponds to  1 n1-highcpu-8 master instance, 4 n1-highcpu-8 worker instances, and n1-standard-4 3 parameter servers. We will cover the details of the distribution strategy and why there are master/worker/parameter designations later in the course. \n",
    "\n",
    "Training will take about 20 minutes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "BUCKET = \"cloud-training-bucket\"  # Replace with your BUCKET\n",
    "REGION = \"us-central1\"            # Choose an available region for AI Platform Training Service\n",
    "TFVERSION = \"1.14\"                # TF version for AI Platform Training Service to use"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Run distributed cloud job\n",
    "\n",
    "After having testing our training pipeline both locally and in the cloud on a susbset of the data, we'll now submit another (much larger) training job to the cloud. The `gcloud` command is almost exactly the same though we'll need to alter some of the previous parameters to point our training job at the much larger dataset. You can follow these links to read more about [\"Using Distributed TensorFlow with Cloud ML Engine\"](https://cloud.google.com/ml-engine/docs/tensorflow/distributed-tensorflow-mnist-cloud-datalab) or [\"Specifying Machine Types or Scale Tiers\"](https://cloud.google.com/ml-engine/docs/tensorflow/machine-types#scale_tiers). \n",
    "\n",
    "Note the `train_data_path` and `eval_data_path` in the Exercise code below as well `train_steps`, the number of training steps.\n",
    "\n",
    "To start, we'll set up our output directory as before, now calling it `trained_large`. Then we submit the training job using `gcloud ai-platform` similar to before. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "OUTDIR = \"gs://{}/taxifare/trained_large\".format(BUCKET)\n",
    "!gcloud storage rm --recursive --continue-on-error {OUTDIR} # start fresh each time\n",
    "!gcloud ai-platform jobs submit training taxifare_large_$(date -u +%y%m%d_%H%M%S) \\\n",
    "    --package-path=taxifaremodel \\\n",
    "    --module-name=taxifaremodel.task \\\n",
    "    --job-dir=gs://{BUCKET}/taxifare \\\n",
    "    --python-version=3.5 \\\n",
    "    --runtime-version={TFVERSION} \\\n",
    "    --region={REGION} \\\n",
    "    --scale-tier=STANDARD_1 \\\n",
    "    -- \\\n",
    "    --train_data_path=gs://cloud-training-demos/taxifare/large/taxi-train*.csv \\\n",
    "    --eval_data_path=gs://cloud-training-demos/taxifare/small/taxi-valid.csv  \\\n",
    "    --train_steps=200000 \\\n",
    "    --output_dir={OUTDIR}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions to obtain larger dataset\n",
    "\n",
    "Note the new `train_data_path` above. It is ~20,000,000 rows (100x the original dataset) and 1.25GB sharded across 10 files. How did we create this file?\n",
    "\n",
    "Go to https://console.cloud.google.com/bigquery and paste the query:\n",
    "<pre>\n",
    "    #standardSQL\n",
    "    SELECT\n",
    "      (tolls_amount + fare_amount) AS fare_amount,\n",
    "      EXTRACT(DAYOFWEEK from pickup_datetime) AS dayofweek,\n",
    "      EXTRACT(HOUR from pickup_datetime) AS hourofday,\n",
    "      pickup_longitude AS pickuplon,\n",
    "      pickup_latitude AS pickuplat,\n",
    "      dropoff_longitude AS dropofflon,\n",
    "      dropoff_latitude AS dropofflat\n",
    "    FROM\n",
    "      `nyc-tlc.yellow.trips`\n",
    "    WHERE\n",
    "      trip_distance > 0\n",
    "      AND fare_amount >= 2.5\n",
    "      AND pickup_longitude > -78\n",
    "      AND pickup_longitude < -70\n",
    "      AND dropoff_longitude > -78\n",
    "      AND dropoff_longitude < -70\n",
    "      AND pickup_latitude > 37\n",
    "      AND pickup_latitude < 45\n",
    "      AND dropoff_latitude > 37\n",
    "      AND dropoff_latitude < 45\n",
    "      AND passenger_count > 0\n",
    "      AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 50)) = 1\n",
    "</pre>\n",
    "\n",
    "Export this to CSV using the following steps (Note that <b>we have already done this and made the resulting GCS data publicly available</b>, so following these steps is optional):\n",
    "<ol>\n",
    "<li> Click on the \"Save Results\" button and select \"BigQuery Table\" (we can't directly export to CSV because the file is too large). \n",
    "<li>Specify a dataset name and table name (if you don't have an existing dataset, <a href=\"https://cloud.google.com/bigquery/docs/datasets#create-dataset\">create one</a>). \n",
    "<li> On the BigQuery console, find the newly exported table in the left-hand-side menu, and click on the name.\n",
    "<li> Click on the \"Export\" button, then select \"Export to GCS\".\n",
    "<li> Supply your bucket and file name (for example: gs://cloud-training-demos/taxifare/large/taxi-train*.csv). The asterisk allows for sharding of large files.\n",
    "</ol>\n",
    "\n",
    "*Note: We are still using the original smaller validation dataset. This is because it already contains ~31K records so should suffice to give us a good indication of learning. 100xing the validation dataset would slow down training because the full validation dataset is proccesed at each checkpoint, and the value of a larger validation dataset is questionable.*\n",
    "<p/>\n",
    "<p/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n",
    "\n",
    "Our previous RMSE was 9.26, and the new RMSE is about the same (9.24), so more training data didn't help.\n",
    "\n",
    "However we still haven't done any feature engineering, so the signal in the data is very hard for the model to extract, even if we have lots of it. In the next section we'll apply feature engineering to try to improve our model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
